Pruning in Decision Trees

Machine Learning Regression

Author

Dr Muhammad Saufi

Published

July 18, 2024

1 Definition

Pruning is a technique used in decision tree algorithms to remove sections of the tree that are not necessary for making accurate predictions. This helps in reducing the complexity of the model and improves its ability to generalize to new, unseen data. Pruning helps to prevent overfitting, which occurs when a model is too complex and captures noise in the training data instead of the underlying pattern.

2 Types of Pruning

  1. Pre-pruning
  2. Post-pruning

2.1 Pre-pruning

Pre-pruning involves stopping the growth of the decision tree early, before it becomes overly complex. This is done by setting certain criteria that, when met, will halt the further splitting of nodes. Common criteria include:

  1. Max Depth:

    • The maximum depth of the tree is limited. Depth is the length of the longest path from the root node to a leaf node.
    • Example: If the max depth is set to 3, the tree will not grow beyond three levels.
  2. Min Split:

    • The minimum number of samples that must be present in a node for a split to be attempted.
    • Example: If the min split is set to 5, a node can be further split only if it contains more than 5 samples.
  3. Min Bucket:

    • The minimum number of samples that must be present in any terminal (leaf) node.
    • Example: If the min bucket is set to 5, every leaf node must have at least 5 samples.

Advantages of Pre-pruning:

  • Reduces the complexity of the model.
  • Prevents overfitting early on.
  • Faster to compute as it stops the tree growth early.

Disadvantages of Pre-pruning:

  • It might lead to underfitting if the criteria are too strict, causing the model to be too simple to capture the underlying patterns in the data.

2.2 Post-pruning

Post-pruning involves allowing the tree to grow fully and then trimming back the overgrown sections. This is done by removing nodes that provide little to no improvement in the prediction accuracy. The process typically involves:

  • Growing the tree to its maximum extent, possibly overfitting the training data.
  • Pruning (cutting) the tree after it has been fully grown to remove nodes that do not provide significant improvements.

Advantages of Post-pruning:

  • It ensures that the most complex model is initially created and then simplified, retaining only the necessary parts.
  • Often results in a more optimal tree compared to pre-pruning.

Disadvantages of Post-pruning:

  • More computationally intensive as it involves growing the full tree.
  • Requires a validation set to determine which parts of the tree to prune.

3 Cost Complexity Pruning

Cost Complexity Pruning is a specific method used in post-pruning. It simplifies the tree by balancing the trade-off between the complexity of the tree and its accuracy on the training data. This is done using a cost complexity parameter (α), which controls how much penalty is applied for having a more complex tree.

  1. Calculate the Cost Complexity:

For each possible tree, calculate the cost complexity function:

\[ R(T) = R_T + \alpha |T| \]

  • \(R_T\): Error of the tree \(T\) on the training data.
  • \(|T|\): Number of terminal nodes (leaves) in the tree.
  • \(\alpha\): A positive constant that determines the penalty for complexity.
  1. Prune the Tree:

    • Gradually increase α and prune the tree by removing nodes that do not significantly reduce the error.
    • The goal is to find the tree structure that minimizes the cost complexity function.
  2. Select the Optimal Tree:

    • The tree with the minimum cost complexity is selected as the final model.

Advantages of Cost Complexity Pruning:

  • Provides a systematic way to balance tree complexity and prediction accuracy.
  • Results in a simpler and more interpretable model.

Disadvantages of Cost Complexity Pruning:

  • Requires careful selection of the α parameter.
  • Can be computationally intensive due to the need to evaluate multiple tree structures.

4 Summary

  • Pruning helps reduce the complexity of decision trees and prevents overfitting.
  • Pre-pruning stops the tree growth early based on certain criteria, while post-pruning involves growing a full tree and then trimming it.
  • Cost Complexity Pruning is a method used in post-pruning that balances the complexity and accuracy of the tree using a cost complexity parameter (α).
 

A dedicated work by Dr Muhammad Saufi